12  Week 12: Hypothesis Testing Fundamentals

Introduction to Statistics for Animal Science

Author

AnS 500 - Fall 2025

Published

November 14, 2025

13 Introduction

You’re an animal nutritionist testing a new feed additive that’s supposed to increase daily weight gain in finishing pigs. You’ve conducted a trial with 30 pigs on the new feed and observed an average daily gain of 0.85 kg/day, compared to the historical average of 0.78 kg/day. The difference looks promising, but is it real, or could it have occurred by chance?

This is the fundamental question that hypothesis testing helps us answer. This week, we’ll develop a formal framework for making decisions about whether observed differences in data represent true effects or simply random variation.

Key Questions We’ll Address:

  • How do we structure hypotheses about our data?
  • What types of errors can we make, and how do we manage them?
  • When should we use one-sample vs two-sample vs paired t-tests?
  • How do we check whether our assumptions are met?
  • What is statistical power, and why should we care?

By the end of this lecture, you’ll be able to conduct and interpret t-tests appropriately, check assumptions, calculate effect sizes, and understand the limitations of hypothesis testing.

NoteBuilding on Previous Weeks

We’ve already covered:

  • Week 1: P-values, study design, and the logic of statistical inference
  • Week 2: Descriptive statistics and distributions
  • Week 3: Probability distributions, Central Limit Theorem, and confidence intervals

This week, we formalize these ideas into hypothesis testing procedures.

14 The Logic of Hypothesis Testing

14.1 The Hypothesis Testing Framework

Hypothesis testing follows a structured approach:

  1. State hypotheses: Define null (H₀) and alternative (H₁) hypotheses
  2. Choose significance level: Typically α = 0.05
  3. Collect data and calculate a test statistic
  4. Calculate p-value: Probability of observing data this extreme under H₀
  5. Make decision: Reject H₀ if p < α, otherwise fail to reject H₀
  6. Interpret in context with effect sizes and confidence intervals
ImportantHypothesis Testing is NOT Absolute Truth

Hypothesis testing doesn’t tell us:

  • Whether H₀ is true
  • Whether H₁ is true
  • The probability we’ve made a mistake

It tells us: How compatible our data is with the null hypothesis.

14.2 Null and Alternative Hypotheses

The null hypothesis (H₀) represents the status quo, no effect, or no difference. It’s the hypothesis we try to find evidence against.

The alternative hypothesis (H₁ or Hₐ) represents what we’re trying to find evidence for—usually that there is an effect or difference.

Example: Feed Additive Trial

  • H₀: The new feed additive has no effect on daily weight gain (μ = 0.78 kg/day)
  • H₁: The new feed additive changes daily weight gain (μ ≠ 0.78 kg/day)

This is a two-sided test because we’re open to the additive increasing or decreasing weight gain.

We could also formulate one-sided tests:

  • H₁: μ > 0.78 (additive increases gain)
  • H₁: μ < 0.78 (additive decreases gain)
WarningOne-Sided vs Two-Sided Tests

Use two-sided tests by default. One-sided tests are only appropriate when:

  1. You have strong a priori reasons to test in only one direction
  2. A difference in the other direction would be meaningless or impossible
  3. You specified the direction before collecting data

Don’t choose one-sided tests just to get p < 0.05!

14.3 Visual Intuition: Is There a Difference?

Let’s visualize what hypothesis testing is trying to determine:

Code
# Simulate two scenarios
set.seed(123)

# Scenario A: No real difference (H0 is true)
scenario_a <- tibble(
  group = rep(c("Control", "Treatment"), each = 30),
  weight_gain = c(rnorm(30, mean = 0.78, sd = 0.12),
                  rnorm(30, mean = 0.78, sd = 0.12))
)

# Scenario B: Real difference (H0 is false)
scenario_b <- tibble(
  group = rep(c("Control", "Treatment"), each = 30),
  weight_gain = c(rnorm(30, mean = 0.78, sd = 0.12),
                  rnorm(30, mean = 0.88, sd = 0.12))
)

# Plot both scenarios
p1 <- ggplot(scenario_a, aes(x = group, y = weight_gain, fill = group)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  labs(title = "Scenario A: No True Difference",
       subtitle = "Both groups sampled from same distribution",
       y = "Daily Weight Gain (kg)",
       x = "") +
  theme(legend.position = "none") +
  ylim(0.4, 1.2)

p2 <- ggplot(scenario_b, aes(x = group, y = weight_gain, fill = group)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  labs(title = "Scenario B: Real Difference",
       subtitle = "Treatment group has higher mean",
       y = "Daily Weight Gain (kg)",
       x = "") +
  theme(legend.position = "none") +
  ylim(0.4, 1.2)

p1 + p2

The challenge: Even when there’s no real difference (Scenario A), we still see some difference in sample means due to random variation. Hypothesis testing helps us quantify whether the observed difference is larger than we’d expect by chance alone.

Code
# Calculate means and run t-tests
scenario_a_summary <- scenario_a %>%
  group_by(group) %>%
  summarise(mean_gain = mean(weight_gain), .groups = 'drop')

scenario_b_summary <- scenario_b %>%
  group_by(group) %>%
  summarise(mean_gain = mean(weight_gain), .groups = 'drop')

test_a <- t.test(weight_gain ~ group, data = scenario_a)
test_b <- t.test(weight_gain ~ group, data = scenario_b)

cat("Scenario A (no true difference):\n")
Scenario A (no true difference):
Code
cat(sprintf("  Control: %.3f kg/day, Treatment: %.3f kg/day\n",
            scenario_a_summary$mean_gain[1], scenario_a_summary$mean_gain[2]))
  Control: 0.774 kg/day, Treatment: 0.801 kg/day
Code
cat(sprintf("  Difference: %.3f kg/day, p-value: %.3f\n\n",
            diff(scenario_a_summary$mean_gain), test_a$p.value))
  Difference: 0.027 kg/day, p-value: 0.342
Code
cat("Scenario B (true difference = 0.10 kg/day):\n")
Scenario B (true difference = 0.10 kg/day):
Code
cat(sprintf("  Control: %.3f kg/day, Treatment: %.3f kg/day\n",
            scenario_b_summary$mean_gain[1], scenario_b_summary$mean_gain[2]))
  Control: 0.783 kg/day, Treatment: 0.869 kg/day
Code
cat(sprintf("  Difference: %.3f kg/day, p-value: %.4f\n",
            diff(scenario_b_summary$mean_gain), test_b$p.value))
  Difference: 0.086 kg/day, p-value: 0.0028

In Scenario A, we correctly fail to reject H₀ (p > 0.05). In Scenario B, we correctly reject H₀ (p < 0.05) and conclude there’s evidence of a real difference.

15 Type I and Type II Errors

When we make decisions based on hypothesis tests, we can make two types of mistakes.

15.1 Definitions

Type I Error (False Positive, α): Rejecting H₀ when it’s actually true

  • Concluding there’s an effect when there isn’t one
  • The probability of Type I error is α (significance level)
  • Typically set at α = 0.05 (5% false positive rate)

Type II Error (False Negative, β): Failing to reject H₀ when it’s actually false

  • Concluding there’s no effect when there is one
  • The probability of Type II error is β
  • Power = 1 - β (probability of correctly rejecting false H₀)

15.2 The Truth Table

H₀ is True H₀ is False
Reject H₀ Type I Error (α) ✓ Correct (Power)
Fail to Reject H₀ ✓ Correct (1-α) Type II Error (β)
NoteWhy α = 0.05?

The 0.05 threshold is a convention, not a law of nature. R.A. Fisher suggested it as a convenient benchmark, but it’s arbitrary. Some fields use:

  • α = 0.01 for more stringent control of false positives
  • α = 0.10 for exploratory research where false negatives are more costly

Focus on effect sizes and confidence intervals, not just whether p < 0.05.

15.3 Consequences in Animal Science

The consequences of these errors depend on context:

Example 1: New Feed Additive

  • Type I Error: Conclude additive works when it doesn’t → waste money on ineffective product
  • Type II Error: Conclude additive doesn’t work when it does → miss opportunity to improve productivity

Example 2: Disease Screening

  • Type I Error: Conclude animal is diseased when it’s healthy → unnecessary treatment, stress, cost
  • Type II Error: Conclude animal is healthy when it’s diseased → disease spreads, welfare issues

The relative costs of these errors should inform your choice of α and sample size (which affects β).

15.4 Simulating Type I Errors

Let’s demonstrate that when H₀ is true, we still get p < 0.05 about 5% of the time:

Code
# Function to run one trial where H0 is TRUE
run_null_true_trial <- function() {
  # Both groups from same distribution (no real difference)
  control <- rnorm(25, mean = 100, sd = 15)
  treatment <- rnorm(25, mean = 100, sd = 15)  # Same mean!

  test_result <- t.test(treatment, control)
  test_result$p.value
}

# Run 1000 trials
n_sims <- 1000
p_values_null_true <- replicate(n_sims, run_null_true_trial())

# Visualize
tibble(p_value = p_values_null_true) %>%
  ggplot(aes(x = p_value)) +
  geom_histogram(bins = 20, fill = "steelblue", alpha = 0.7, color = "white") +
  geom_vline(xintercept = 0.05, color = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "Distribution of P-values When H₀ is True",
       subtitle = sprintf("Proportion with p < 0.05: %.3f (expected: 0.05)",
                         mean(p_values_null_true < 0.05)),
       x = "P-value",
       y = "Count") +
  theme_minimal(base_size = 12)

Code
type_i_rate <- mean(p_values_null_true < 0.05)
cat(sprintf("Type I error rate: %.3f (%.1f%%)\n", type_i_rate, type_i_rate * 100))
Type I error rate: 0.040 (4.0%)
Code
cat(sprintf("Out of %d trials: %d false positives\n",
            n_sims, sum(p_values_null_true < 0.05)))
Out of 1000 trials: 40 false positives

Key insight: Even when there’s no real effect, we’ll get “significant” results about 5% of the time. This is why replication is crucial in science!

15.5 Simulating Type II Errors

Now let’s see Type II errors—when there IS a real effect, but we fail to detect it:

Code
# Function to run one trial where H0 is FALSE (real effect exists)
run_null_false_trial <- function(true_effect = 10, sample_size = 20, sd = 15) {
  control <- rnorm(sample_size, mean = 100, sd = sd)
  treatment <- rnorm(sample_size, mean = 100 + true_effect, sd = sd)

  test_result <- t.test(treatment, control)
  test_result$p.value
}

# Small effect, small sample
p_values_small <- replicate(n_sims, run_null_false_trial(true_effect = 5, sample_size = 20))

# Medium effect, small sample
p_values_medium <- replicate(n_sims, run_null_false_trial(true_effect = 10, sample_size = 20))

# Medium effect, large sample
p_values_large_n <- replicate(n_sims, run_null_false_trial(true_effect = 10, sample_size = 50))

# Combine results
error_rates <- tibble(
  Scenario = c("Small effect (d=0.33), n=20",
               "Medium effect (d=0.67), n=20",
               "Medium effect (d=0.67), n=50"),
  `Type II Error Rate (β)` = c(mean(p_values_small >= 0.05),
                                 mean(p_values_medium >= 0.05),
                                 mean(p_values_large_n >= 0.05)),
  `Power (1-β)` = 1 - `Type II Error Rate (β)`
)

knitr::kable(error_rates, digits = 3, align = 'lcc',
             caption = "Type II Error Rates Under Different Scenarios")
Type II Error Rates Under Different Scenarios
Scenario Type II Error Rate (β) Power (1-β)
Small effect (d=0.33), n=20 0.831 0.169
Medium effect (d=0.67), n=20 0.474 0.526
Medium effect (d=0.67), n=50 0.093 0.907

Key insights:

  1. Smaller effects are harder to detect (higher β)
  2. Larger samples reduce Type II error (increase power)
  3. Even with a real effect, we often fail to detect it with small samples!

16 Statistical Power

Power is the probability of correctly rejecting a false null hypothesis. In other words, it’s the probability of detecting an effect when one truly exists.

\[\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ is false})\]

16.1 Factors Affecting Power

Power depends on four factors:

  1. Effect size: Larger effects are easier to detect
  2. Sample size (n): More data = more power
  3. Significance level (α): Lower α = lower power (trade-off with Type I error)
  4. Variability (σ): Less noisy data = more power
TipTypical Power Target

Many researchers aim for power = 0.80 (80% probability of detecting the effect). This means accepting a 20% Type II error rate (β = 0.20).

The choice of 80% is conventional, like α = 0.05. In some contexts (e.g., clinical trials), you might want higher power (0.90 or 0.95).

16.2 Visualizing Power

Let’s visualize how power changes with sample size and effect size:

Code
# Function to calculate power via simulation
calculate_power_sim <- function(n, effect_size, sd = 15, alpha = 0.05, n_sims = 1000) {
  p_values <- replicate(n_sims, {
    control <- rnorm(n, mean = 100, sd = sd)
    treatment <- rnorm(n, mean = 100 + effect_size, sd = sd)
    t.test(treatment, control)$p.value
  })
  mean(p_values < alpha)
}

# Calculate power for different scenarios
power_data <- expand_grid(
  n = seq(10, 100, by = 5),
  effect_size = c(5, 10, 15, 20)
) %>%
  mutate(
    effect_label = sprintf("Effect = %d (d = %.2f)", effect_size, effect_size / 15),
    power = map2_dbl(n, effect_size, ~calculate_power_sim(.x, .y, n_sims = 500))
  )

# Plot power curves
ggplot(power_data, aes(x = n, y = power, color = effect_label)) +
  geom_line(linewidth = 1.2) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray40") +
  geom_hline(yintercept = 0.05, linetype = "dotted", color = "gray60") +
  annotate("text", x = 90, y = 0.82, label = "Target power = 0.80", size = 3) +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Statistical Power vs Sample Size",
       subtitle = "For two-sample t-test with SD = 15, α = 0.05",
       x = "Sample Size per Group",
       y = "Power (1 - β)",
       color = "Effect Size") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

Key insights from power curves:

  1. Small effects need large samples: For a small effect (d = 0.33), you need n ≈ 90 per group to achieve 80% power
  2. Large effects need small samples: For a large effect (d = 1.33), n ≈ 10 per group is sufficient
  3. Diminishing returns: Going from n=20 to n=40 adds more power than going from n=60 to n=80

16.3 Sample Size Planning

Before conducting a study, you should estimate the required sample size to achieve adequate power. This requires:

  1. Specify α: Usually 0.05
  2. Specify desired power: Usually 0.80
  3. Estimate effect size: Based on pilot data or literature
  4. Estimate variability: SD or variance

Example: Feed Additive Study

Suppose we expect the feed additive to increase daily gain by 0.08 kg (from 0.78 to 0.86 kg/day), and we know the SD ≈ 0.12 kg/day from previous trials. How many pigs do we need?

Code
# Calculate Cohen's d
expected_effect <- 0.08
expected_sd <- 0.12
cohens_d <- expected_effect / expected_sd

cat(sprintf("Expected Cohen's d: %.2f\n", cohens_d))
Expected Cohen's d: 0.67
Code
# Use power.t.test for analytical power calculation
power_analysis <- power.t.test(
  delta = expected_effect,    # Difference in means
  sd = expected_sd,           # Standard deviation
  sig.level = 0.05,           # α
  power = 0.80,               # Desired power
  type = "two.sample",
  alternative = "two.sided"
)

print(power_analysis)

     Two-sample t test power calculation 

              n = 36.3058
          delta = 0.08
             sd = 0.12
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group
Code
cat(sprintf("\nRequired sample size: %.0f pigs per group\n", ceiling(power_analysis$n)))

Required sample size: 37 pigs per group
Code
cat(sprintf("Total pigs needed: %.0f\n", 2 * ceiling(power_analysis$n)))
Total pigs needed: 74

Interpretation: We need approximately 36 pigs per group (72 total) to have 80% power to detect a difference of 0.08 kg/day.

ImportantPower Analysis Should Be Done BEFORE Data Collection

Conducting power analysis after your study doesn’t change anything about your results. Power analysis is most useful for:

  1. Study planning: Determine required sample size before collecting data
  2. Interpreting non-significant results: A non-significant result from an underpowered study is uninformative
  3. Evaluating published research: Were studies adequately powered to detect realistic effects?

“Post-hoc power analysis” of your own data is generally not recommended.

17 One-Sample t-Test

A one-sample t-test compares a sample mean to a known or hypothesized population value.

Research question: Does the sample mean differ from a specific value?

Example: A dairy nutritionist wants to know if a new feeding program affects milk production. Historical data shows the average milk production is 35 kg/day. After implementing the new program, they measure milk production in 25 cows.

17.1 Assumptions

  1. Independence: Observations are independent of each other
  2. Normality: The data (or sampling distribution of means) is approximately normal
NoteCentral Limit Theorem to the Rescue

The t-test is fairly robust to violations of normality when:

  • Sample size is moderate to large (n > 30 as a rule of thumb)
  • The distribution isn’t extremely skewed
  • There are no extreme outliers

For small samples (n < 30), normality is more important.

17.2 Worked Example: Milk Production

Code
# Simulate milk production data
set.seed(456)
milk_production <- tibble(
  cow_id = 1:25,
  milk_kg = rnorm(25, mean = 37.2, sd = 4.5)  # True mean = 37.2 (higher than historical 35)
)

# Summary statistics
milk_summary <- milk_production %>%
  summarise(
    n = n(),
    mean = mean(milk_kg),
    sd = sd(milk_kg),
    se = sd / sqrt(n),
    median = median(milk_kg),
    min = min(milk_kg),
    max = max(milk_kg)
  )

knitr::kable(milk_summary, digits = 2,
             caption = "Summary Statistics: Milk Production (kg/day)")
Summary Statistics: Milk Production (kg/day)
n mean sd se median min max
25 38.32 5.34 1.07 38.94 29.47 47.46

Hypotheses:

  • H₀: μ = 35 kg/day (no change from historical average)
  • H₁: μ ≠ 35 kg/day (production has changed)

17.2.1 Step 1: Check Assumptions

Code
# Visual check: Histogram and QQ plot
p1 <- ggplot(milk_production, aes(x = milk_kg)) +
  geom_histogram(bins = 10, fill = "steelblue", alpha = 0.7, color = "white") +
  geom_vline(xintercept = 35, color = "red", linetype = "dashed", linewidth = 1) +
  annotate("text", x = 35, y = 5.5, label = "Historical\nmean = 35",
           color = "red", hjust = -0.1, size = 3) +
  labs(title = "Distribution of Milk Production",
       x = "Milk Production (kg/day)",
       y = "Count") +
  theme_minimal(base_size = 11)

p2 <- ggplot(milk_production, aes(sample = milk_kg)) +
  stat_qq(color = "steelblue", size = 2) +
  stat_qq_line(color = "red", linetype = "dashed") +
  labs(title = "Q-Q Plot",
       subtitle = "Checking normality assumption",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal(base_size = 11)

p1 + p2

Visual assessment: The histogram looks reasonably symmetric, and the QQ plot shows points close to the line → normality assumption seems reasonable.

Formal test: Shapiro-Wilk test

Code
shapiro_result <- shapiro.test(milk_production$milk_kg)
cat(sprintf("Shapiro-Wilk test: W = %.4f, p-value = %.3f\n",
            shapiro_result$statistic, shapiro_result$p.value))
Shapiro-Wilk test: W = 0.9544, p-value = 0.315
Code
if(shapiro_result$p.value > 0.05) {
  cat("→ No evidence against normality (p > 0.05)\n")
} else {
  cat("→ Some evidence against normality (p < 0.05)\n")
}
→ No evidence against normality (p > 0.05)
WarningDon’t Over-Rely on Normality Tests

Shapiro-Wilk and other normality tests can be:

  • Too sensitive with large samples (reject for trivial deviations)
  • Not sensitive enough with small samples (fail to detect important deviations)

Recommendation: Use visual assessment (QQ plots) as your primary tool, and consider the robustness of the t-test.

17.2.2 Step 2: Conduct the t-Test

Code
# One-sample t-test
milk_test <- t.test(milk_production$milk_kg, mu = 35)

# Tidy output
milk_test_tidy <- tidy(milk_test)

knitr::kable(milk_test_tidy, digits = 3,
             caption = "One-Sample t-Test Results")
One-Sample t-Test Results
estimate statistic p.value parameter conf.low conf.high method alternative
38.319 3.105 0.005 24 36.113 40.524 One Sample t-test two.sided
Code
# Print results
cat(sprintf("\nOne-Sample t-Test Results:\n"))

One-Sample t-Test Results:
Code
cat(sprintf("Sample mean: %.2f kg/day\n", milk_test$estimate))
Sample mean: 38.32 kg/day
Code
cat(sprintf("Hypothesized mean: %.2f kg/day\n", 35))
Hypothesized mean: 35.00 kg/day
Code
cat(sprintf("Difference: %.2f kg/day\n", milk_test$estimate - 35))
Difference: 3.32 kg/day
Code
cat(sprintf("t-statistic: %.3f\n", milk_test$statistic))
t-statistic: 3.105
Code
cat(sprintf("Degrees of freedom: %d\n", milk_test$parameter))
Degrees of freedom: 24
Code
cat(sprintf("P-value: %.4f\n", milk_test$p.value))
P-value: 0.0048
Code
cat(sprintf("95%% CI: [%.2f, %.2f]\n", milk_test$conf.int[1], milk_test$conf.int[2]))
95% CI: [36.11, 40.52]

17.2.3 Step 3: Interpret Results

Code
# Visualize result with confidence interval
ggplot(milk_production, aes(x = "Sample", y = milk_kg)) +
  geom_jitter(width = 0.1, alpha = 0.4, size = 2, color = "steelblue") +
  geom_hline(yintercept = 35, color = "red", linetype = "dashed", linewidth = 1) +
  stat_summary(fun = mean, geom = "point", size = 4, color = "darkblue") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
               width = 0.2, color = "darkblue", linewidth = 1) +
  annotate("text", x = 1.3, y = 35, label = "Historical mean\n(H₀: μ = 35)",
           color = "red", hjust = 0, size = 3) +
  annotate("text", x = 1.3, y = milk_test$estimate,
           label = sprintf("Sample mean\n%.1f kg/day", milk_test$estimate),
           color = "darkblue", hjust = 0, size = 3) +
  labs(title = "Milk Production: Sample vs Historical Mean",
       subtitle = sprintf("95%% CI: [%.1f, %.1f], p = %.4f",
                         milk_test$conf.int[1], milk_test$conf.int[2], milk_test$p.value),
       x = "",
       y = "Milk Production (kg/day)") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank())

Statistical conclusion: We reject H₀ (p = 0.005). There is strong evidence that milk production under the new feeding program differs from the historical average of 35 kg/day.

Practical interpretation: The new feeding program is associated with an increase of approximately 3.3 kg/day (95% CI: [1.1, 5.5]). This is both statistically significant and potentially economically meaningful.

17.3 Effect Size for One-Sample t-Test

Code
# Calculate Cohen's d for one-sample test
cohens_d_one_sample <- (milk_test$estimate - 35) / milk_summary$sd

cat(sprintf("Cohen's d: %.3f\n", cohens_d_one_sample))
Cohen's d: 0.621
Code
# Interpretation
if(abs(cohens_d_one_sample) < 0.2) {
  interpretation <- "negligible"
} else if(abs(cohens_d_one_sample) < 0.5) {
  interpretation <- "small"
} else if(abs(cohens_d_one_sample) < 0.8) {
  interpretation <- "medium"
} else {
  interpretation <- "large"
}

cat(sprintf("Effect size interpretation: %s\n", interpretation))
Effect size interpretation: medium

Cohen’s d guidelines (rough benchmarks):

  • d = 0.2: Small effect
  • d = 0.5: Medium effect
  • d = 0.8: Large effect

Our effect size is medium, indicating a substantial difference from the historical mean.

18 Two-Sample t-Test (Independent Samples)

A two-sample t-test compares means between two independent groups.

Research question: Does the mean of Group A differ from the mean of Group B?

Example: An animal scientist wants to compare weight gain in beef cattle fed two different grain supplements (Supplement A vs Supplement B). Forty steers are randomly assigned to one of the two supplements (20 per group).

18.1 Assumptions

  1. Independence: Observations within and between groups are independent
  2. Normality: Data in each group is approximately normally distributed
  3. Equal variances (for standard t-test): The variances in the two groups are equal
NoteWelch’s t-Test: Unequal Variances

If variances are unequal, use Welch’s t-test (the default in R’s t.test()). It adjusts the degrees of freedom to account for unequal variances.

The standard t-test assumes equal variances (Student’s t-test), but Welch’s t-test is more robust and is generally recommended as the default.

18.2 Worked Example: Grain Supplements

Code
# Simulate weight gain data
set.seed(789)
cattle_gain <- tibble(
  supplement = rep(c("A", "B"), each = 20),
  weight_gain_kg = c(
    rnorm(20, mean = 185, sd = 22),  # Supplement A
    rnorm(20, mean = 205, sd = 25)   # Supplement B (higher gain)
  )
)

# Summary statistics by group
cattle_summary <- cattle_gain %>%
  group_by(supplement) %>%
  summarise(
    n = n(),
    mean = mean(weight_gain_kg),
    sd = sd(weight_gain_kg),
    se = sd / sqrt(n),
    median = median(weight_gain_kg),
    .groups = 'drop'
  )

knitr::kable(cattle_summary, digits = 2,
             caption = "Summary Statistics: Weight Gain by Supplement")
Summary Statistics: Weight Gain by Supplement
supplement n mean sd se median
A 20 178.18 15.96 3.57 176.60
B 20 197.72 17.84 3.99 196.95

Hypotheses:

  • H₀: μ_A = μ_B (no difference in weight gain between supplements)
  • H₁: μ_A ≠ μ_B (supplements lead to different weight gains)

18.2.1 Step 1: Visualize the Data

Code
# Box plots with individual points
p1 <- ggplot(cattle_gain, aes(x = supplement, y = weight_gain_kg, fill = supplement)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.5, size = 2) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3,
               fill = "red", color = "darkred") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  labs(title = "Weight Gain by Supplement Type",
       subtitle = "Diamonds show group means",
       x = "Supplement",
       y = "Weight Gain (kg)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

# Density plots
p2 <- ggplot(cattle_gain, aes(x = weight_gain_kg, fill = supplement)) +
  geom_density(alpha = 0.6) +
  geom_vline(data = cattle_summary, aes(xintercept = mean, color = supplement),
             linetype = "dashed", linewidth = 1) +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  scale_color_manual(values = c("#E69F00", "#56B4E9")) +
  labs(title = "Distribution of Weight Gain",
       subtitle = "Dashed lines show group means",
       x = "Weight Gain (kg)",
       y = "Density") +
  theme_minimal(base_size = 12)

p1 + p2

Visual assessment: Supplement B appears to produce higher weight gain on average, with some overlap between distributions.

18.2.2 Step 2: Check Assumptions

Normality Check (QQ Plots by Group):

Code
ggplot(cattle_gain, aes(sample = weight_gain_kg, color = supplement)) +
  stat_qq(size = 2) +
  stat_qq_line(linetype = "dashed") +
  facet_wrap(~supplement) +
  scale_color_manual(values = c("#E69F00", "#56B4E9")) +
  labs(title = "Q-Q Plots by Supplement",
       subtitle = "Checking normality assumption",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Both groups show reasonably normal distributions.

Equal Variance Check (Levene’s Test):

Code
# Levene's test for homogeneity of variance
levene_result <- leveneTest(weight_gain_kg ~ supplement, data = cattle_gain)

cat("Levene's Test for Equality of Variances:\n")
Levene's Test for Equality of Variances:
Code
print(levene_result)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  0.7142 0.4033
      38               
Code
cat(sprintf("\nP-value: %.3f\n", levene_result$`Pr(>F)`[1]))

P-value: 0.403
Code
if(levene_result$`Pr(>F)`[1] > 0.05) {
  cat("→ No evidence of unequal variances (p > 0.05)\n")
  cat("  Standard t-test or Welch's t-test are both appropriate\n")
} else {
  cat("→ Evidence of unequal variances (p < 0.05)\n")
  cat("  Welch's t-test is recommended\n")
}
→ No evidence of unequal variances (p > 0.05)
  Standard t-test or Welch's t-test are both appropriate
TipVisualizing Variance Differences

A simple way to compare variances visually:

Code
cattle_gain %>%
  ggplot(aes(x = supplement, y = weight_gain_kg)) +
  geom_boxplot(aes(fill = supplement), alpha = 0.4) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
               geom = "errorbar", width = 0.3, linewidth = 1, color = "darkred") +
  stat_summary(fun = mean, geom = "point", size = 3, color = "darkred") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  labs(title = "Mean ± SD by Group",
       subtitle = "Error bars show ±1 SD",
       x = "Supplement",
       y = "Weight Gain (kg)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

If one group’s error bars are much longer, variances may be unequal.

18.2.3 Step 3: Conduct the t-Test

Code
# Welch's t-test (default, doesn't assume equal variances)
cattle_test_welch <- t.test(weight_gain_kg ~ supplement, data = cattle_gain)

# Student's t-test (assumes equal variances)
cattle_test_student <- t.test(weight_gain_kg ~ supplement, data = cattle_gain, var.equal = TRUE)

# Compare results
cat("Welch's t-Test (unequal variances assumed):\n")
Welch's t-Test (unequal variances assumed):
Code
cat(sprintf("  t(%.2f) = %.3f, p = %.4f\n",
            cattle_test_welch$parameter,
            cattle_test_welch$statistic,
            cattle_test_welch$p.value))
  t(37.54) = -3.651, p = 0.0008
Code
cat(sprintf("  95%% CI for difference: [%.2f, %.2f]\n\n",
            cattle_test_welch$conf.int[1], cattle_test_welch$conf.int[2]))
  95% CI for difference: [-30.38, -8.70]
Code
cat("Student's t-Test (equal variances assumed):\n")
Student's t-Test (equal variances assumed):
Code
cat(sprintf("  t(%d) = %.3f, p = %.4f\n",
            cattle_test_student$parameter,
            cattle_test_student$statistic,
            cattle_test_student$p.value))
  t(38) = -3.651, p = 0.0008
Code
cat(sprintf("  95%% CI for difference: [%.2f, %.2f]\n",
            cattle_test_student$conf.int[1], cattle_test_student$conf.int[2]))
  95% CI for difference: [-30.38, -8.71]

Note: In this case, both tests give similar results because the variances are fairly similar. When in doubt, use Welch’s t-test (the default).

18.2.4 Step 4: Calculate Effect Size

Code
# Calculate Cohen's d
cattle_cohen_d <- cohen.d(weight_gain_kg ~ supplement, data = cattle_gain)

print(cattle_cohen_d)

Cohen's d

d estimate: -1.154639 (large)
95 percent confidence interval:
     lower      upper 
-1.8460956 -0.4631817 
Code
cat(sprintf("\nCohen's d: %.3f\n", cattle_cohen_d$estimate))

Cohen's d: -1.155
Code
cat(sprintf("95%% CI for d: [%.3f, %.3f]\n",
            cattle_cohen_d$conf.int[1], cattle_cohen_d$conf.int[2]))
95% CI for d: [-1.846, -0.463]
Code
# Interpretation
d_value <- abs(cattle_cohen_d$estimate)
if(d_value < 0.2) {
  interpretation <- "negligible"
} else if(d_value < 0.5) {
  interpretation <- "small"
} else if(d_value < 0.8) {
  interpretation <- "medium"
} else {
  interpretation <- "large"
}

cat(sprintf("Effect size interpretation: %s\n", interpretation))
Effect size interpretation: large

18.2.5 Step 5: Interpret and Report

Code
# Create a clean visualization for reporting
mean_diff <- diff(cattle_summary$mean)

ggplot(cattle_summary, aes(x = supplement, y = mean, fill = supplement)) +
  geom_col(alpha = 0.7, width = 0.6) +
  geom_errorbar(aes(ymin = mean - se * 1.96, ymax = mean + se * 1.96),
                width = 0.2, linewidth = 1) +
  geom_text(aes(label = sprintf("%.1f kg", mean)),
            vjust = -2.5, fontface = "bold", size = 4) +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  annotate("segment", x = 1, xend = 2, y = 230, yend = 230,
           arrow = arrow(ends = "both", length = unit(0.2, "cm"))) +
  annotate("text", x = 1.5, y = 235,
           label = sprintf("Difference: %.1f kg\np = %.3f",
                          mean_diff, cattle_test_welch$p.value),
           size = 3.5, fontface = "bold") +
  labs(title = "Weight Gain by Supplement Type",
       subtitle = "Error bars show 95% confidence intervals",
       x = "Supplement",
       y = "Weight Gain (kg)") +
  ylim(0, 250) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Results Summary:

Beef steers receiving Supplement B gained significantly more weight (M = 197.7 kg, SD = 17.8) compared to those receiving Supplement A (M = 178.2 kg, SD = 16.0), t(37.5) = -3.65, p = 0.001, 95% CI [-30.4, -8.7], d = 1.15. This represents a large effect.

Practical interpretation: Supplement B produces approximately 20 kg more weight gain than Supplement A, which could translate to meaningful economic benefits for producers.

19 Paired t-Test

A paired t-test (also called dependent samples t-test) compares means when observations are paired or matched.

When to use paired t-tests:

  • Before-after measurements on the same individuals
  • Matched pairs (e.g., siblings, litter mates)
  • Repeated measures on the same experimental units

Key advantage: Pairing removes between-subject variability, increasing statistical power.

19.1 Why Pairing Matters

When measurements are paired, we’re interested in the differences within pairs, not the absolute values in each group.

Example: Testing a feed supplement by measuring weight gain in the same animals before and after treatment is more powerful than comparing two different groups, because we control for individual variation in baseline weight and genetics.

19.2 Worked Example: Milk Yield Before and After Treatment

A dairy researcher wants to test whether a new probiotic supplement increases milk yield. They measure milk production in 20 cows before supplementation, then again after 4 weeks on the supplement.

Code
# Simulate paired data
set.seed(321)

# Create baseline variation between cows
cow_baseline <- rnorm(20, mean = 30, sd = 5)

# Before treatment: baseline + random day-to-day variation
milk_before <- cow_baseline + rnorm(20, mean = 0, sd = 2)

# After treatment: baseline + treatment effect + random variation
treatment_effect <- 2.5  # True effect = 2.5 kg/day increase
milk_after <- cow_baseline + treatment_effect + rnorm(20, mean = 0, sd = 2)

# Combine into data frame
milk_paired <- tibble(
  cow_id = 1:20,
  before = milk_before,
  after = milk_after,
  difference = after - before
)

# Long format for plotting
milk_paired_long <- milk_paired %>%
  pivot_longer(cols = c(before, after),
               names_to = "time",
               values_to = "milk_yield") %>%
  mutate(time = factor(time, levels = c("before", "after")))

# Summary statistics
milk_paired_summary <- milk_paired_long %>%
  group_by(time) %>%
  summarise(
    n = n(),
    mean = mean(milk_yield),
    sd = sd(milk_yield),
    se = sd / sqrt(n),
    .groups = 'drop'
  )

knitr::kable(milk_paired_summary, digits = 2,
             caption = "Summary Statistics: Milk Yield Before and After Treatment")
Summary Statistics: Milk Yield Before and After Treatment
time n mean sd se
before 20 31.13 5.22 1.17
after 20 34.06 4.66 1.04
Code
# Summary of differences
diff_summary <- milk_paired %>%
  summarise(
    mean_diff = mean(difference),
    sd_diff = sd(difference),
    se_diff = sd_diff / sqrt(n())
  )

cat(sprintf("\nMean difference (after - before): %.2f kg/day\n", diff_summary$mean_diff))

Mean difference (after - before): 2.93 kg/day
Code
cat(sprintf("SD of differences: %.2f kg/day\n", diff_summary$sd_diff))
SD of differences: 2.88 kg/day

19.2.1 Visualizing Paired Data

The key to paired data is visualizing the connections between measurements:

Code
# Paired plot showing connections
p1 <- ggplot(milk_paired_long, aes(x = time, y = milk_yield, group = cow_id)) +
  geom_line(alpha = 0.3, color = "gray50") +
  geom_point(aes(color = time), size = 2, alpha = 0.7) +
  stat_summary(aes(group = 1), fun = mean, geom = "line",
               color = "red", linewidth = 1.5, linetype = "solid") +
  stat_summary(aes(group = 1), fun = mean, geom = "point",
               color = "red", size = 4, shape = 18) +
  scale_color_manual(values = c("steelblue", "orange")) +
  labs(title = "Milk Yield Before and After Treatment",
       subtitle = "Lines connect measurements from same cow",
       x = "",
       y = "Milk Yield (kg/day)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

# Distribution of differences
p2 <- ggplot(milk_paired, aes(x = difference)) +
  geom_histogram(bins = 12, fill = "steelblue", alpha = 0.7, color = "white") +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = diff_summary$mean_diff, color = "darkgreen",
             linetype = "solid", linewidth = 1) +
  annotate("text", x = 0, y = 4.5, label = "No change",
           color = "red", hjust = -0.1, size = 3) +
  annotate("text", x = diff_summary$mean_diff, y = 4.5,
           label = sprintf("Mean\ndiff = %.1f", diff_summary$mean_diff),
           color = "darkgreen", hjust = -0.1, size = 3) +
  labs(title = "Distribution of Within-Cow Differences",
       x = "Difference in Milk Yield (kg/day)",
       y = "Count") +
  theme_minimal(base_size = 12)

p1 + p2

Key observation: Most lines slope upward, indicating that individual cows increased milk production. The distribution of differences is centered above zero.

19.2.2 Paired vs Unpaired Analysis

Let’s compare what happens if we (incorrectly) treat this as unpaired data:

Code
# Paired t-test (CORRECT)
paired_test <- t.test(milk_paired$after, milk_paired$before, paired = TRUE)

# Unpaired t-test (INCORRECT for this design)
unpaired_test <- t.test(milk_paired$after, milk_paired$before, paired = FALSE)

# Compare results
cat("PAIRED t-Test (Correct Analysis):\n")
PAIRED t-Test (Correct Analysis):
Code
cat(sprintf("  t(%d) = %.3f, p = %.4f\n",
            paired_test$parameter, paired_test$statistic, paired_test$p.value))
  t(19) = 4.553, p = 0.0002
Code
cat(sprintf("  95%% CI for difference: [%.2f, %.2f]\n\n",
            paired_test$conf.int[1], paired_test$conf.int[2]))
  95% CI for difference: [1.58, 4.28]
Code
cat("UNPAIRED t-Test (Incorrect Analysis):\n")
UNPAIRED t-Test (Incorrect Analysis):
Code
cat(sprintf("  t(%.1f) = %.3f, p = %.4f\n",
            unpaired_test$parameter, unpaired_test$statistic, unpaired_test$p.value))
  t(37.5) = 1.873, p = 0.0688
Code
cat(sprintf("  95%% CI for difference: [%.2f, %.2f]\n\n",
            unpaired_test$conf.int[1], unpaired_test$conf.int[2]))
  95% CI for difference: [-0.24, 6.10]
Code
cat("Why the difference?\n")
Why the difference?
Code
cat(sprintf("  Paired test SE: %.3f\n",
            diff_summary$sd_diff / sqrt(20)))
  Paired test SE: 0.644
Code
cat(sprintf("  Unpaired test SE: %.3f\n",
            sqrt(var(milk_paired$before)/20 + var(milk_paired$after)/20)))
  Unpaired test SE: 1.564
Code
cat("  → Paired test has smaller SE (removes between-cow variation)\n")
  → Paired test has smaller SE (removes between-cow variation)

Key insight: The paired test is more powerful because it accounts for the fact that we measured the same cows twice. The unpaired test includes unnecessary between-cow variability, making the standard error larger and the test less sensitive.

ImportantPairing Increases Power

In this example:

  • Paired test: p = 0.0002 → Significant at α = 0.05
  • Unpaired test: p = 0.0688 → May or may not be significant

The paired test is more powerful because we’re comparing each cow to itself, removing individual differences in baseline milk production.

Rule: If your data are paired by design, you MUST use a paired test!

19.2.3 Conduct Paired t-Test

Code
# Paired t-test
paired_result <- t.test(milk_paired$after, milk_paired$before, paired = TRUE)

# Tidy output
paired_tidy <- tidy(paired_result)

knitr::kable(paired_tidy, digits = 4,
             caption = "Paired t-Test Results")
Paired t-Test Results
estimate statistic p.value parameter conf.low conf.high method alternative
2.9309 4.5532 2e-04 19 1.5836 4.2782 Paired t-test two.sided
Code
# Print interpretation
cat("\nPaired t-Test Results:\n")

Paired t-Test Results:
Code
cat(sprintf("Mean difference: %.2f kg/day\n", paired_result$estimate))
Mean difference: 2.93 kg/day
Code
cat(sprintf("t(%d) = %.3f\n", paired_result$parameter, paired_result$statistic))
t(19) = 4.553
Code
cat(sprintf("P-value: %.4f\n", paired_result$p.value))
P-value: 0.0002
Code
cat(sprintf("95%% CI: [%.2f, %.2f]\n", paired_result$conf.int[1], paired_result$conf.int[2]))
95% CI: [1.58, 4.28]
Code
if(paired_result$p.value < 0.05) {
  cat("\n→ Significant at α = 0.05. Evidence that treatment increases milk yield.\n")
} else {
  cat("\n→ Not significant at α = 0.05. Insufficient evidence of treatment effect.\n")
}

→ Significant at α = 0.05. Evidence that treatment increases milk yield.

19.2.4 Effect Size for Paired t-Test

Code
# Cohen's d for paired data (based on differences)
paired_d <- cohen.d(milk_paired$after, milk_paired$before, paired = TRUE)

print(paired_d)

Cohen's d

d estimate: 0.5832096 (medium)
95 percent confidence interval:
    lower     upper 
0.3027274 0.8636919 
Code
cat(sprintf("\nCohen's d (paired): %.3f\n", paired_d$estimate))

Cohen's d (paired): 0.583
Code
# Interpretation
d_val <- abs(paired_d$estimate)
if(d_val < 0.2) {
  d_interp <- "negligible"
} else if(d_val < 0.5) {
  d_interp <- "small"
} else if(d_val < 0.8) {
  d_interp <- "medium"
} else {
  d_interp <- "large"
}

cat(sprintf("Effect size interpretation: %s\n", d_interp))
Effect size interpretation: medium

19.2.5 Assumptions for Paired t-Test

Paired t-test assumes:

  1. Independence of pairs (not within pairs)
  2. Normality of the differences (not the original measurements)
Code
# Check normality of DIFFERENCES
ggplot(milk_paired, aes(sample = difference)) +
  stat_qq(color = "steelblue", size = 2) +
  stat_qq_line(color = "red", linetype = "dashed") +
  labs(title = "Q-Q Plot of Differences",
       subtitle = "Checking normality assumption for paired t-test",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles (Differences)") +
  theme_minimal(base_size = 12)

Code
# Shapiro-Wilk test on differences
shapiro_diff <- shapiro.test(milk_paired$difference)
cat(sprintf("\nShapiro-Wilk test on differences: W = %.4f, p = %.3f\n",
            shapiro_diff$statistic, shapiro_diff$p.value))

Shapiro-Wilk test on differences: W = 0.9514, p = 0.389

The differences appear approximately normal, so the paired t-test is appropriate.

19.2.6 Reporting Paired t-Test Results

Example write-up:

Milk yield was measured in 20 dairy cows before and after 4 weeks of probiotic supplementation. A paired-samples t-test revealed that milk yield increased significantly after treatment (M_after = 34.1 kg/day, SD = 4.7) compared to before treatment (M_before = 31.1 kg/day, SD = 5.2), t(19) = 4.55, p < 0.001, 95% CI [1.6, 4.3], d = 0.58. The probiotic supplement increased milk production by an average of 2.9 kg/day, representing a medium effect.

20 Choosing the Right t-Test

Here’s a decision flowchart for selecting the appropriate t-test:

20.1 Common Scenarios in Animal Science

Scenario Test Type Example
Compare sample mean to historical value One-sample Is current milk yield different from breed average?
Compare two independent groups Two-sample (independent) Feed A vs Feed B in randomly assigned pigs
Compare before and after in same animals Paired Weight before and after medication in same cattle
Compare littermates or twins Paired Twin calves assigned to different treatments
Compare males vs females Two-sample (independent) Growth rate in male vs female lambs
Compare left vs right (same animal) Paired Udder health in left vs right quarters
WarningCommon Mistake: Treating Paired Data as Independent

Don’t use a two-sample t-test when data are paired! This:

  1. Wastes information (ignores pairing)
  2. Reduces power (larger standard error)
  3. May lead to incorrect conclusions

If measurements are connected (same subject, matched pairs, siblings), use a paired test!

21 Practical Considerations

21.1 Checking Assumptions in Practice

21.1.1 Normality

Visual methods (preferred):

  • Histograms: Look for roughly symmetric, bell-shaped distribution
  • QQ plots: Points should fall close to the diagonal line
  • Density plots: Compare to normal curve overlay

Formal tests (use with caution):

  • Shapiro-Wilk test: shapiro.test()
  • Kolmogorov-Smirnov test: ks.test()
TipWhen Normality is Violated

If data are clearly non-normal:

  1. Transform the data: Log, square root, or Box-Cox transformation
  2. Use non-parametric tests: Wilcoxon rank-sum test (Mann-Whitney U) instead of two-sample t-test
  3. Bootstrap confidence intervals: Resample to estimate sampling distribution
  4. Rely on CLT: With large samples (n > 30), t-tests are robust to non-normality

Most common approach: If n > 30 and no extreme outliers/skewness, proceed with t-test.

21.1.2 Equal Variances

Visual methods:

  • Compare SD between groups (ratio < 2 is usually fine)
  • Compare boxplot heights and spreads

Formal test:

  • Levene’s test: car::leveneTest()

Default recommendation: Use Welch’s t-test (doesn’t assume equal variances) as your default. It’s robust and performs well even when variances are equal.

21.1.3 Independence

This is the most important and least testable assumption.

Violations occur when:

  • Observations are clustered (e.g., multiple measurements per animal)
  • Time series data with autocorrelation
  • Spatial dependence (e.g., neighboring pens)
  • Pseudo-replication (treating subsamples as independent)

Solutions:

  • Use mixed models to account for clustering
  • Aggregate repeated measures appropriately
  • Design studies to ensure independence
ImportantIndependence Cannot Be Fixed Post-Hoc

Unlike normality or equal variance assumptions, independence violations cannot be rescued with transformations or alternative tests. You must design your study correctly from the beginning.

Example of pseudo-replication: Taking 5 blood samples from each of 4 cows and analyzing n=20 samples is wrong. The samples within each cow are not independent. The true n is 4 cows, not 20 samples.

21.2 Effect Sizes and Practical Significance

Statistical significance ≠ Practical significance

A result can be statistically significant (p < 0.05) but:

  • The effect size is tiny
  • The difference is not economically or biologically meaningful
  • The cost of implementation outweighs the benefit

Always report:

  1. P-value: Strength of evidence against H₀
  2. Confidence interval: Range of plausible values for the effect
  3. Effect size: Standardized measure of magnitude (Cohen’s d)
  4. Means and SDs: Raw values for practical interpretation
NoteCohen’s d Interpretation (Guidelines)
d Value Interpretation Meaning
0.0 - 0.2 Negligible Trivial difference
0.2 - 0.5 Small Noticeable to researchers
0.5 - 0.8 Medium Visible to the naked eye
0.8+ Large Obvious, practically meaningful

Remember: These are rough guidelines. What matters is context:

  • In medicine, small effects can be life-saving
  • In agriculture, medium effects must be cost-effective
  • In behavior, large effects are rare and important

21.3 Sample Size and Power Considerations

Before conducting your study, calculate required sample size:

Code
# Example: Planning a feed trial
# Expected difference: 0.15 kg/day weight gain
# Expected SD: 0.20 kg/day
# Desired power: 0.80
# Alpha: 0.05

power_calc <- power.t.test(
  delta = 0.15,
  sd = 0.20,
  sig.level = 0.05,
  power = 0.80,
  type = "two.sample"
)

print(power_calc)

     Two-sample t test power calculation 

              n = 28.89962
          delta = 0.15
             sd = 0.2
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group
Code
cat(sprintf("\nTo detect a difference of %.2f kg/day with 80%% power:\n", 0.15))

To detect a difference of 0.15 kg/day with 80% power:
Code
cat(sprintf("  Required sample size: %.0f animals per group\n", ceiling(power_calc$n)))
  Required sample size: 29 animals per group
Code
cat(sprintf("  Total animals needed: %.0f\n", 2 * ceiling(power_calc$n)))
  Total animals needed: 58

Trade-offs:

  • Smaller effect → Need larger sample
  • More variable data → Need larger sample
  • Higher power → Need larger sample
  • Lower α → Need larger sample

21.4 Common Mistakes and Pitfalls

21.4.1 1. Multiple Comparisons (p-hacking)

Problem: Running many t-tests increases the chance of false positives.

Example: Testing 20 different outcomes. With α = 0.05, you expect 1 false positive even if there are no real effects.

Solutions:

  • Bonferroni correction: Divide α by number of tests (α_adjusted = 0.05 / 20 = 0.0025)
  • ANOVA followed by post-hoc tests (covered next week)
  • Pre-specify primary outcomes before analysis

21.4.2 2. Confusing Statistical and Practical Significance

Problem: With large samples, tiny effects become “significant”

Example: Weight gain differs by 0.5 kg (p = 0.03) but costs $50 more per animal → Not worth it!

Solution: Always consider effect size and cost-benefit

21.4.3 3. One-Tailed Tests Without Justification

Problem: Using one-tailed tests to achieve p < 0.05

Solution: Use two-tailed tests by default. Only use one-tailed if:

  • You have strong theoretical reason
  • Effect in opposite direction is impossible or meaningless
  • You pre-registered the hypothesis

21.4.4 4. Ignoring Assumptions

Problem: Running t-tests without checking assumptions

Solution: Always check:

  • Normality (QQ plots)
  • Equal variances (visual or Levene’s test)
  • Independence (by design)

21.4.5 5. Treating Paired Data as Independent

Problem: Using two-sample t-test for paired data

Solution: If same subjects measured twice or matched pairs → Use paired t-test!

21.5 Reporting t-Test Results

Essential elements:

  1. Descriptive statistics: Means, SDs, sample sizes for each group
  2. Test used: One-sample, two-sample (Welch’s or Student’s), or paired
  3. Test statistic: t-value and degrees of freedom
  4. P-value: Exact value (not just “< 0.05”)
  5. Confidence interval: For the difference
  6. Effect size: Cohen’s d
  7. Interpretation: In context of the research question

Example:

Milk yield increased significantly after probiotic supplementation (M = 32.5 kg/day, SD = 4.8) compared to baseline (M = 30.1 kg/day, SD = 4.6), t(19) = 4.12, p < 0.001, 95% CI [1.2, 3.6], d = 0.92. This represents a large effect and an average increase of 2.4 kg/day per cow.

22 Summary and Key Takeaways

22.1 What We Covered

  1. Hypothesis testing framework: Structured approach to evaluating claims about populations
  2. Type I and Type II errors: Understanding false positives (α) and false negatives (β)
  3. Statistical power: Probability of detecting real effects; influenced by effect size, sample size, α, and variability
  4. One-sample t-test: Comparing sample mean to known value
  5. Two-sample t-test: Comparing means between independent groups
  6. Paired t-test: Comparing means for related/matched observations
  7. Assumptions: Normality, equal variances, independence
  8. Effect sizes: Cohen’s d for standardized effect magnitude
  9. Practical significance: Statistical significance doesn’t guarantee practical importance

22.2 Key Principles

ImportantCore Principles of Hypothesis Testing
  1. P-values measure evidence, not truth: A p-value tells you how compatible your data is with H₀, not whether H₀ is true or false

  2. Effect sizes matter more than p-values: Always report and interpret effect sizes and confidence intervals

  3. Design determines analysis: Paired vs independent determines which test to use—this cannot be changed after data collection

  4. Assumptions matter: Check them, but also understand t-tests are fairly robust (especially with larger samples)

  5. Power drives sample size: Calculate required sample size BEFORE collecting data

  6. Context is everything: Statistical significance without practical significance is meaningless

22.3 Decision Framework

When analyzing data:

  1. Identify your research question: What are you comparing?
  2. Choose appropriate test:
    • One sample? → One-sample t-test
    • Two independent groups? → Two-sample t-test
    • Paired/matched data? → Paired t-test
  3. Check assumptions: Normality (QQ plots), equal variances (visual or Levene’s), independence (by design)
  4. Conduct test: Calculate t-statistic and p-value
  5. Calculate effect size: Cohen’s d for standardized magnitude
  6. Interpret in context: Consider both statistical and practical significance
  7. Report completely: Means, SDs, n, test statistics, p-values, CIs, effect sizes

22.4 Next Week Preview

Week 5: Analysis of Variance (ANOVA)

  • Comparing more than two groups (extension of t-tests)
  • Understanding variance partitioning (between-group vs within-group)
  • Post-hoc tests (Tukey HSD, Bonferroni)
  • Multiple comparisons problem
  • When to use ANOVA vs multiple t-tests
NoteComing Full Circle

ANOVA is mathematically equivalent to the t-test when comparing two groups. Next week, we’ll see how to generalize hypothesis testing to multiple groups while controlling Type I error rates.

23 Practice Problems

23.1 Problem 1: One-Sample Scenario

A poultry researcher measures egg weight from 30 hens. The breed standard is 62 grams. The sample mean is 64.5 grams with SD = 5.2 grams. Is there evidence that egg weight differs from the breed standard?

Tasks:

  1. State the null and alternative hypotheses
  2. Conduct a one-sample t-test
  3. Calculate Cohen’s d
  4. Interpret the results
Code
# Your code here
set.seed(999)
egg_weights <- rnorm(30, mean = 64.5, sd = 5.2)

# a) Hypotheses: H0: μ = 62, H1: μ ≠ 62

# b) Test
test1 <- t.test(egg_weights, mu = 62)
print(test1)

# c) Effect size
d1 <- (mean(egg_weights) - 62) / sd(egg_weights)
cat(sprintf("Cohen's d: %.3f\n", d1))

# d) Interpret...

23.2 Problem 2: Two-Sample Scenario

A beef cattle trial compares average daily gain (ADG) for two protein sources. Soybean meal (n=25): M=1.45 kg/day, SD=0.22. Corn gluten (n=25): M=1.38 kg/day, SD=0.19. Is there a significant difference?

Tasks:

  1. State hypotheses
  2. Check equal variance assumption
  3. Conduct two-sample t-test (Welch’s)
  4. Calculate effect size
  5. Provide practical interpretation

23.3 Problem 3: Paired Scenario

A veterinarian measures body temperature in 15 calves before and after administering an anti-inflammatory drug. Should you use a paired or unpaired test? Why? What are the hypotheses?

23.4 Problem 4: Power Analysis

You’re planning a study to detect a 10% difference in weaning weight (Expected means: 250 kg vs 275 kg, SD = 30 kg). How many calves do you need per group to achieve 80% power with α = 0.05?

Code
# Your code here
power.t.test(
  delta = 25,
  sd = 30,
  sig.level = 0.05,
  power = 0.80,
  type = "two.sample"
)

23.5 Problem 5: Assumption Checking

You’ve collected data from two groups but aren’t sure if assumptions are met. What plots would you create? What tests would you run? What would you do if assumptions are violated?


24 Additional Resources

24.1 R Functions Covered

Function Package Purpose
t.test() base One-sample, two-sample, and paired t-tests
power.t.test() base Power and sample size calculations
shapiro.test() base Test normality
leveneTest() car Test equality of variances
cohen.d() effsize Calculate Cohen’s d effect size
tidy() broom Tidy statistical output

24.2 Further Reading

  • Cumming, G. (2012). Understanding the New Statistics. Excellent on effect sizes and confidence intervals
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Classic reference on power
  • Lakens, D. (2013). “Calculating and reporting effect sizes to facilitate cumulative science.” Frontiers in Psychology.
  • American Statistical Association (2016). “Statement on P-values and Statistical Significance.”

24.3 Online Resources

  • R for Data Science (2e): https://r4ds.hadley.nz/
  • Statistical Thinking: https://www.fharrell.com/
  • StatQuest Videos (YouTube): Excellent visual explanations of t-tests and power

End of Week 4 Lecture

Next week: Analysis of Variance (ANOVA) - extending t-tests to multiple groups!